[WIP] A Hitchhiker’s Guide to LLVM

Just Another Day in the Life of a SSA Variable

Compilers
LLVM
Author

Swayam Singh

Published

June 3, 2025

Cover Image

Table of Content

Note

This guide is written during the release of LLVM version 21.0.0

Building clang and LLVM

Dependencies (because why not?)

Package Version Notes
CMake >=3.20.0 Makefile/workspace generator
python >=3.8 Automated test suite
Zlib >=1.2.3.4 Compression library
GNU Make 3.79, 3.79.1 Makefile/build processor
PyYAML >=5.1 Header generator

Building clang (trust me you’ll need it)

Clang is a llvm based compiler driver which provide frontend and invokes right tools for the llvm backend to compile the C-like languages.

git clone https://github.com/llvm/llvm-project.git
cmake -DLLVM_ENABLE_PROJECTS=clang -GNinja -DCMAKE_BUILD_TYPE=Release llvm
ninja clang 
Tip

How I know the tags you may ask? checkout the llvm-project/llvm/CMakeLists.txt file and search for LLVM_ALL_PROJECTS and read the comments above it.

OR the instructions are also available here: https://clang.llvm.org/get_started.html

Building LLVM (yeah the dragon itself)

mkdir build
cd build
cmake -G Ninja \
  -DCMAKE_BUILD_TYPE=Debug \
  -DLLVM_ENABLE_PROJECTS="all" \
  -DLLVM_OPTIMIZED_TABLEGEN=1 \
  ../llvm
ninja # this will build all the targets

Notably some major targets generated inside you’ll find inside build/bin are:

  • opt: Driver for performing optimizations on LLVM IR i.e. LLVM IR => Optimized LLVM IR
  • llc: Driver for transforming LLVM IR to assembly or object file
  • llvm-mc Driver to interact with assembled and disassembled object
  • check Driver to run test

Let’s quickly run the check target with ninja and it’ll automatically run the test suite

ninja check

# or specific target
ninja check-<target-name>

# can print all the targets available by
ninja help

LLVM also has llvm-lit tool that runs the specified tests

./bin/llvm-lit test/CodeGen/RISCV/GlobalISel

# Output
-- Testing: 403 tests, 96 workers --
PASS: LLVM :: CodeGen/RISCV/GlobalISel/regbankselect/fp-arith-f16.mir (1 of 403)
PASS: LLVM :: CodeGen/RISCV/GlobalISel/legalizer/legalize-bswap-rv64.mir (2 of 403)
PASS: LLVM :: CodeGen/RISCV/GlobalISel/irtranslator/calls.ll (3 of 403)
[...]

There is also a tool inside LLVM known as FileCheck this basically governs the verification of test outputs. I won’t be going in depth but take it as you can write the expected outputs within comments inside the IR or any file and the lit when running test invoke FileCheck for verification, here is a small example

; RUN: opt < %s -passes=mem2reg | FileCheck %s

define i32 @example() {
  ; CHECK-LABEL: @example(
  ; CHECK-NOT: alloca
  ; CHECK: ret i32 42

  %x = alloca i32
  store i32 42, ptr %x
  %result = load i32, ptr %x
  ret i32 %result
}

In this case FileCheck will verify that the optimized IR has a label “@example” and optimize the current IR by removing the “alloca” instruction and directly returning 42 as int32.

We can also verify this by on our own quickly. Create a test.ll file with the exact same content as shown above and since we already build the LLVM so we have the required tools to test this, run the following commands:

# This will output the optimized IR (you can verify the pattern by yourself)
./bin/opt -S < ../../tweaks/test.ll -passes=mem2reg
# Here we are using the generated optimized IR as stdin to the FileCheck
./bin/opt -S < ../../tweaks/test.ll -passes=mem2reg | ./bin/FileCheck ../../tweaks/test.ll  # no output mean success
Note
  • ./bin/opt < ../../tweaks/test.ll -passes=mem2reg: will generate the bytecode by default, use -S to get the textual representation
  • passes=mem2reg is just telling opt driver what optimization pass to apply
Tip

Tools like lit and FileCheck are build by LLVM folks but they are very general to use for other projects and langauge as well, semi-colon ; comments are specific to LLVM but these tools can work with any language and its comment styles

Read more about them and usage:

Warning

You should be thinking of a question at the moment (atleast I had):

The optimization can reorder the instruction, how these checks would be valid in that case? well short answer, go to the docs and read about CHECK-DAG

LLVM tests are located inside the build directory as: - unittests: typical test written using gtest suite - tests: they are the lit tests, we already saw the example of a typical lit test in llvm above

LLVM-Project directory tree

  • The high level codebase division is organized among projects: mlir, clang, lldb, openmp, etc. which we also pass inside the -DLLVM_ENABLE_PROJECTS to specify the targets during build
  • Each of the project is further organized as:
    • lib: Contain all the libraries like for llvm-project: CodeGen, Analysis, Linker, etc
    • include/<project>: These are the exposed public headers
      • Here the structure is as, you can see a folder for each project as in lib
    • tools: Contains specific tools for llvm: like xcode-toolchain, llc, etc
    • unittests: Gtests
    • tests: llvm-it tests
    • utils: Utility tools like FileCheck and llvm-lit
    Note

    For any library inside the lib folder, you can find corresponding

    • include/<project>/<lib>
    • unittests/<lib>
    • test/<lib>
  • To gather the public and private headers for some <project>/<lib>
    • Public headers: <project>/include/<project>/<lib>/<file>
    • Private headers: <project>/lib/<file>
    • The paths inside the code file can also be identified in include directive as:
      • Public: <project>/<lib>/<file>
      • Private: <file>
  • llvm project specifications:
    • LLVM-IR
      • It is mainly found inside the lib/IR
      • IT optimization specifics are available at lib/Analysis and lib/Transforms
      • Binary and textual representations are handled in lib/Bitcode, lib/IRReader and lib/ITWriter
    • General backend code generation is in lib/CodeGen
    • Target specific backend code generation: lib/Target/<backend>

Some Important Compiler Lingo

Build Time:

Time taken to build the compiler; the time it takes for make or ninja to complete the build of LLVM

Compile Time:

Time taken to convert a source file to the object file by the compiler

Runtime:

Time taken to execute that binary file. i.e. time taken to run the binary file produced by clang

Runtime of the compiler is basically the compile time of the application

Canonical form

A recommended way of representing expressions or a piece of code.

a = b + 2;
// OR
a = 2 + b;

Agreeing on a canonical form means that the compiler will strive to generate expressions in only one way and sometimes it is possible that any other form can impact performance.

Middle-End

Usually this term is acknowledged inside backend but for the specificity here comes the target-agnostic transformations of the IR.

Backend

Target-Specific transformations on the IR

Note

Three levels in LLVM

  • LLVM IR (High-level, target-independent)
  • Machine IR (Low-Level, target-specific, still IR)
  • Assembly/Machine Code (Final output)
LLVM IR              Machine IR              Assembly
-------              ----------              --------
define i32 @add  →   MachineFunction    →    add:
%result = add    →   ADD32 %vreg1      →        mov eax, edi
ret i32 %result  →   RET %vreg1        →        add eax, esi
                                       →        ret

Application Binary Interface (ABI)

It can be tricky at first to understand, so think of it as a protocol that defines exactly how functions communicate with each other at machine code level. Consider we have 2 functions

int add(int a, int b) {
    return a + b;
}

int main() {
    int result = add(5, 10);
    return result;
}

When main calls add several low-level questions need answering:

  • Where does main put the values 5 and 10 so that add can find them?
  • Where does add puts the result 15 so main can retrive it?
  • How is the function call actually made at the assembly level?

An ABI provides these specific answers, for example in x85-64 a common ABI might specify:

  • First int arguement goes in RDI register
  • Second int argument goes in RSI register
  • Return values goes in RAX register
  • Stack must be 16-byte aligned before calls
mov rdi, 5
mov rsi, 10
call add
Note
  • ABI \(\neq\) Assembly, targets have their ABI and assembly instructions are used to code them
  • You must define your own ABI when writing your own backend. For interoperability, it is recommended to use the existing if available
  • One target can have multiple ABIs possible but for a particular calling convention there is one matching ABI.
  • Different compilers following same ABI rules then a function compiled on one can interact with the one compiled by other.

Encoding

Encoding is the specification of how to translate assembly instructions into the actual 1s and 0s that the CPU reads. Let’s say we want to encode add eax, ebx

The x86 encoding might be:

Binary:    00000001 11011000
Hex:       01 D8

The ISA document of a particular target tells you exactly what each bit means:

First Byte: 01 (Opcode)

01 = ADD instruction with format “r/m32, r32” (meaning: destination=r/m, source=reg)

Second Byte: D8 (ModR/M byte)

Binary: 11011000 Breaking down the ModR/M byte:

  • Bits 7-6 (MOD): 11 = both operands are registers (not memory)
  • Bits 5-3 (REG): 011 = EBX register (source)
  • Bits 2-0 (R/M): 000 = EAX register (destination)

So 01 D8 means: “ADD the value in EBX (reg field) to EAX (r/m field)”

Misc

Virtual and Physical registers

Physical registers are the actual registers that exist in your CPU:

x86-64 has: RAX, RBX, RCX, RDX, RSI, RDI, R8-R15, etc.
ARM has: R0-R15, etc.

They are limited in numbers as x86-64 has ~16 general-purpose registers and they are specifically named.

On the other hand, Virtual registers are the temporary names created by the compiler

%vreg0, %vreg1, %vreg2, %vreg3, %vreg4, %vreg5, ...

They can be unlimited, doesn’t correspond to real hardware and generically numbered. Machine IR uses virtual registers and in later steps in compilation they get mapped to the physical registers with assembly.

Structures & Relations within LLVM programs

A typical program is divided into the following structure in LLVM processing; top-down manner:

  • Modules
  • Functions
  • Basic Block
  • Instruction
    • Definition
    • Arguments / operands
    • operation code / opcode

Following are the way these components or structures are connected

  • Control Flow Graphs (CFG)

Modules:

A module is essentially the container for everything the compiler is working on at a given time. Think of it as:

Input File (e.g., main.c) → LLVM Module

The module contains:

  • All function definitions (main(), foo(), etc.)
  • Global variables
  • Metadata
  • Everything else needed to compile that input

It is also known as Translation Unit or Compilation Unit, all refer to same concept - the unit of compilation.